Module 03: Linear Regression
The University of Alabama
2026-03-23
Can you design a linear road that pass by all these houses equally?!

Can you design a linear road that pass by all these houses equally?!


Price Equation
Price = 100 + 50 * (#Rooms)
WEIGHTS
Each feature gets multiplied by a corresponding factor. These factors are the weights. In the above formula the only feature is the number of rooms, and its value is 50.
BIAS
Constant that is not attached to any of the features. It is called the bias. In this model, the bias is 100 and it corresponds to the base price of a house.
How do machines learn this equation?


Linear Regression
Price = 100 + 50 * (#Rooms)
Slope
Measures how steep the line is.
Y-intercept
It is the height at which the line crosses the vertical (y) axis.

Linear Equation
Linear Equation is the equation of a line. \(y=mx+b\)
How many lines can solve the problem?
Price = 100 + 50 \(\times\) (#Rooms)
Price = 100 + 50 \(\times\) (4) = 300
Some Questions:

Price = 30 * (#Rooms) + 1.5 * (Size) + 10 * (Schools Quality) - 2 * (Age) + 50

Inputs: A dataset of points.
Outputs: A linear regression model that fits that dataset.
\[ y' = mx + b \]
where: - y’: is the price value that we’re trying to predict. - m: is the slope of the line. - x: is the number of rooms value of our input feature. - b: is the y-intercept.
In machine learning, we’ll write the equation for a model slightly differently:
\[ y' = w_1x_1 + w_0 \]
where: - y’: is the predicted label (a desired output). - w₁: is the weight of feature 1. Weight is the same concept as the “slope”. - x₁: is feature 1. - w₀ or b: is the bias (the y-intercept).
Suppose we selected the following weights and biases. Which of them have lower loss?


Some Questions: - How to define loss to measure the performance of the model? - What initial values should we set for \(w_1\) and \(w_0\)? - How to update \(w_1\) and \(w_0\)?
Which model is better and why? Which model have a lower loss?

The absolute loss is the sum of the absolute differences between the observed and predicted values.

\(| \text{observation}(x) - \text{prediction}(x) | = |(y-y')|\)
The absolute loss is the sum of the absolute differences between the observed and predicted values.

The squared loss is the sum of the squared differences between the observed and predicted values.

\([ \text{observation}(x) - \text{prediction}(x) ]^2 = [(y-y')]^2\)
The squared loss is the sum of the squared differences between the observed and predicted values.

Is the average squared loss per example over the whole dataset.
\[ \text{MSE} = \frac{1}{N}\sum\limits_{(x,y)\in D}(y-\text{prediction}(x))^2 \]
Note that a gradient is a vector, so it has both of the following characteristics: - Magnitude - Direction

The gradient descent algorithm takes a step in the direction of the negative gradient.

The gradient descent algorithm adds some fraction of the gradient’s magnitude (Learning Rate \(\eta\)) to the previous point.

\[ w_{new} = w_{old} - \eta \cdot \frac{d\ \text{loss}}{dw} \]
\[ \text{loss}(w_0, w_1) = [y - y']^2 = [y - (w_1x + w_0)]^2 \]
\[ \frac{d\ \text{loss}}{dw_1} = 2(y - y')(-x) = 2x(y' - y) \] \[ \frac{d\ \text{loss}}{dw_0} = 2(y - y')(-1) = 2(y' - y) \]
Procedure:
Pick a random data point \((x^{(i)},y^{(i)})\).
Compute Model Prediction \(y'^{(i)} = w_1x_1^{(i)} + w_0\)
Update the weights and bias using the following equations:
\[ \color{red}{w_1} = \color{red}{w_1} - \eta \frac{d\ \text{loss}}{dw_1} = \color{red}{w_1} - \eta 2x_1^{(i)}(\underbrace{y'^{(i)}-y^{(i)}}_{\color{red}{\text{error}}}) \] \[ \color{red}{w_0} = \color{red}{w_0} - \eta \frac{d\ \text{loss}}{dw_0} = \color{red}{w_0} - \eta 2(\underbrace{y'^{(i)}-y^{(i)}}_{\color{red}{\text{error}}}) \]



Small Learning Rate

Large Learning Rate

Optimal Learning Rate (usually 0.01)

\[ y' = w_1x_1 + w_2x_2 + \dots + w_nx_n + w_0 \]
\[ y' = \sum\limits_{i=0}^{i=n} w_i x_i \]
Gradient Derivation \[\begin{align*} \frac{d\ell}{dw_i} &= \frac{d\ell}{dy'} \frac{dy'}{dw_i} \\ &= 2(y'-y) \cdot x_i \end{align*}\]


Normal equation is a closed-form solution to the linear regression problem.
It provides an exact solution to the model parameters \(\textbf{W}\) that minimize the squared error.
Since we have: \[ \textbf{Y} = \textbf{X} \cdot \textbf{W} \]
Multiply both sides by the inverse of \(\textbf{X}\): \[ \textbf{X}^{-1} \cdot \textbf{Y} = \textbf{W} \]
Since \(\textbf{X}\) is not a square matrix, we can’t find the inverse of \(\textbf{X}\).
We can use the following trick: \[ \textbf{X}^T \textbf{Y} = \textbf{X}^T \textbf{X} \textbf{W} \]
Now, multiply both sides by the inverse of \(\textbf{X}^T \textbf{X}\):
\[ (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{Y} = \textbf{W} \]




The University of Alabama